111 research outputs found
Leveraging the Performance of LBM-HPC for Large Sizes on GPUs using Ghost Cells
Today, we are living a growing demand of larger and more efficient computational resources from the scienti c community. On the
other hand, the appearance of GPUs for general purpose computing supposed an important advance for covering such demand. These devices o er an impressive computational capacity at low cost and an efficient power consumption. However, the memory available in these devices is (sometimes) not enough, and so it is necessary computationally expensive memory transfers from (to) CPU to (from) GPU, causing a dramatic fall in performance. Recently, the Lattice-Boltzmann Method has positioned as an e cient methodology for fluid simulations. Although this method presents some interesting features particularly amenable to be efficiently exploited on parallel computers, it requires a considerable memory capacity, which can suppose an important drawback, in particular, on GPUs. In the present paper, it is proposed a new GPU-based implementation, which minimizes such requirements with respect to other state-of-the-art implementations. It allows us to execute almost 2 bigger problems without additional memory transfers, achieving faster executions when dealing with large problems
A Non-uniform Staggered Cartesian Grid approach for Lattice-Boltzmann method
We propose a numerical approach based on the Lattice-Boltzmann method (LBM) for dealing with mesh refinement of Non-uniform Staggered Cartesian Grid. We explain, in detail, the strategy for mapping LBM over such geometries. The main benefit of this approach, compared to others, consists of solving all fluid units only once per time-step, and also reducing considerably the complexity of the communication and memory management between different refined levels. Also, it exhibits a better matching for parallel processors. To validate our method, we analyze several standard test scenarios, reaching satisfactory results with respect to other stateof-the-art methods. The performance evaluation proves that our approach not only exhibits a simpler and efficient scheme for dealing with mesh refinement, but also fast resolution, even in those scenarios where our approach needs to use a higher number of fluid units
LBM-HPC - An open-source tool for fluid simulations. Case study: Unified parallel C (UPC-PGAS)
The main motivation of this work is the evaluation of the Unified Parallel C (UPC) model, for Boltzmann-fluid simulations. UPC is one of the current models in the so-called Partitioned Global Address Space paradigm. This paradigm attempts to increase the simplicity of codes and achieve a better efficiency and scalability. Two different UPC-based implementations, explicit and implicit, are presented and evaluated. We compare the fundamental features of our UPC implementations with other parallel programming model, MPI-OpenMP. In particular each of the major steps of any LBM code, i.e., Boundary Conditions, Communication, and LBM solver, are analyzed
Multi-domain grid refinement for lattice-Boltzmann simulations on heterogeneous platforms
The main contribution of the present work consists of several parallel approaches for grid refinement based on a multi-domain decomposition for lattice-Boltzmann simulations. The proposed method for discretizing the fluid incorporates different regular Cartesian grids with no homogeneous spatial domains, which are in need to be communicated each other. Three different parallel approaches are proposed, homogeneous Multicore, homogeneous GPU, and heterogeneous Multicore-GPU. Although, the homogeneous implementations exhibit satisfactory results, the heterogeneous approach achieves up to 30% extra efficiency, in terms of Millions of Fluid Lattice Updates per Second (MFLUPS), by overlapping some of the steps on both architectures, Multicore and GPU
Towards HPC-Embedded Case Study: Kalray and Message-Passing on NoC
Today one of the most important challenges in HPC is the development of computers with a low power consumption. In this context, recently, new embedded many-core systems have emerged. One of them is Kalray. Unlike other many-core architectures, Kalray is not a co-processor (self-hosted). One interesting feature of the Kalray architecture is the Network on Chip (NoC) connection. Habitually, the communication in many-core architectures is carried out via shared memory. However, in Kalray, the communication among processing elements can also be via Message-Passing on the NoC. One of the main motivations of this work is to present the main constraints to deal with the Kalray architecture. In particular, we focused on memory management and communication. We assess the use of NoC and shared memory on Kalray. Unlike shared memory, the implementation of Message-Passing on NoC is not transparent from programmer point of view. The synchronization among processing elements and NoC is other of the challenges to deal with in the Karlay processor. Although the synchronization using Message-Passing is more complex and consuming time than using shared memory, we obtain an overall speedup close to 6 when using Message-Passing on NoC with respect to the use of shared memory. Additionally, we have measured the power consumption of both approaches. Despite of being faster, the use of NoC presents a higher power consumption with respect to the approach that exploits shared memory. This additional consumption in Watts is about a 50%. However, the reduction in time by using NoC has an important impact on the overall power consumption as well
Recommended from our members
Accelerating solid-fluid interaction using Lattice-Boltzmann and Immersed Boundary coupled simulations on heterogeneous platforms
We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle the problem of the interaction of solids with an incompressible fluid flow. The proposed method uses a Cartesian uniform grid that incorporates both the fluid and the solid domain. This is a very optimum and novel method to solve this problem and is a growing research topic in Computational Fluid Dynamics. We explain in detail the parallelization of the whole method on both GPUs and an heterogeneous GPU-Multicore platform and describe different optimizations, focusing on memory management and CPU-GPU communication. Our performance evaluation consists of a series of numerical experiments that simulate situations of industrial and research interest. Based on these tests, we have shown that the baseline LBM implementation achieves satisfactory results on GPUs. Unfortunately, when coupling LBM and IB methods on GPUs, the overheads of IB degrade the overall performance. As an alternative we have explored an heterogeneous implementation that is able to hide such overheads and allows us to exploit both Multicore and GPU resources in a cooperative way
Recommended from our members
Fast finite difference Poisson solvers on heterogeneous architectures
In this paper we propose and evaluate a set of new strategies for the solution of three dimensional separable elliptic problems on CPU–GPU platforms. The numerical solution of the system of linear equations arising when discretizing those operators often represents the most time consuming part of larger simulation codes tackling a variety of physical situations. Incompressible fluid flows, electromagnetic problems, heat transfer and solid mechanic simulations are just a few examples of application areas that require efficient solution strategies for this class of problems. GPU computing has emerged as an attractive alternative to conventional CPUs for many scientific applications. High speedups over CPU implementations have been reported and this trend is expected to continue in the future with improved programming support and tighter CPU–GPU integration. These speedups by no means imply that CPU performance is no longer critical. The conventional CPU-control–GPU-compute pattern used in many applications wastes much of CPU’s computational power. Our proposed parallel implementation of a classical cyclic reduction algorithm to tackle the large linear systems arising from the discretized form of the elliptic problem at hand, schedules computing on both the GPU and the CPUs in a cooperative way. The experimental result demonstrates the effectiveness of this approach
Leveraging the performance of LBM-HPC for large sizes on GPUs using ghost cells
Today, we are living a growing demand of larger and more efficient computational resources from the scientific community. On the other hand, the appearance of GPUs for general purpose computing supposed an important advance for covering such demand. These devices offer an impressive computational capacity at low cost and an efficient power consumption. However, the memory available in these devices is (sometimes) not enough, and so it is necessary computationally expensive memory transfers from (to) CPU to (from) GPU, causing a dramatic fall in performance. Recently, the Lattice-Boltzmann Method has positioned as an efficient methodology for fluid simulations. Although this method presents some interesting features particularly amenable to be efficiently exploited on parallel computers, it requires a considerable memory capacity, which can suppose an important drawback, in particular, on GPUs. In the present paper, it is proposed a new GPU-based implementation, which minimizes such requirements with respect to other state-of-the-art implementations. It allows us to execute almost 2 bigger problems without additional memory transfers, achieving faster executions when dealing with large problems
Many-task computing on many-core architectures
Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it is not so popular in shared memory parallel processors. In this sense and given the spectacular growth in performance and in number of cores integrated in many-core architectures, the study of MTC on such architectures is becoming more and more relevant. In this paper, authors present what are those programming mechanisms to take advantages of such massively parallel features for the particular target of MTC. Also, the hardware features of the two dominant many-core platforms (NVIDIA's GPUs and Intel Xeon Phi) are also analyzed for our specific framework. Given the important differences in terms of hardware and software in our two many-core platforms, we have considered different strategies based on CUDA (for GPUs) and OpenMP (for Intel Xeon Phi). We carried out several test cases based on an appropriate and widely studied problem for benchmarking as matrix multiplication. Essentially, this study consisted of comparing the time consumed for computing in parallel several tasks one by one (the whole computational resources are used just to compute one task at a time) with the time consumed for computing in parallel the same set of tasks simultaneously (the whole computational resources are used for computing the set of tasks at very same time). Finally, we compared both software-hardware scenarios to identify the most relevant computer features in each of our many-core architectures
Many Neglected Tropical Diseases May Have Originated in the Paleolithic or Before: New Insights from Genetics
The standard view of modern human infectious diseases is that many of them arose during the Neolithic when animals were first domesticated, or afterwards. Here we review recent genetic and molecular clock estimates that point to a much older Paleolithic origin (2.5 million years ago to 10,000 years ago) of some of these diseases. During part of this ancient period our early human ancestors were still isolated in Africa. We also discuss the need for investigations of the origin of these diseases in African primates and other animals that have been the original source of many neglected tropical diseases
- …